Optimizing Stencil Computations for NVIDIA Kepler GPUs
نویسندگان
چکیده
We present a series of optimization techniques for stencil computations on NVIDIA Kepler GPUs. Stencil computations with regular grids had been ported to the older generations of NVIDIA GPUs with significant performance improvements thanks to the higher memory bandwidth than conventional CPU-only systems. However, because of the architectural changes introduced with the latest generation of the GPU architecture, Kepler, we show that existing implementation strategies used for such older GPUs are not as effective on Kepler as before. To fully exploit the potential performance of the latest generation of the GPU architecture, our implementation method uses shared memory for better data locality combined with warp specialization for higher instruction throughput. Our method achieves approximately 80% of the estimated peak performance by the roofline model, and even higher performance with temporal blocking.
منابع مشابه
Evaluating multi-core and many-core architectures through accelerating the three-dimensional Lax-Wendroff correction stencil
Wave propagation forward modeling is a widely used computational method in oil and gas exploration. The iterative stencil loops in such problems have broad applications in scientific computing. However, executing such loops can be highly time-consuming, which greatly limits their performance and power efficiency. In this paper, we accelerate the forward-modeling technique on the latest multi-co...
متن کاملPerformance of Kepler GTX Titan GPUs and Xeon Phi System
NVIDIA’s new architecture, Kepler improves GPU’s performance significantly with the new streaming multiprocessor SMX. Along with the performance, NVIDIA has also introduced many new technologies such as direct parallelism, hyper-Q and GPU Direct with RDMA. Apart from other usual GPUs, NVIDIA also released another Kepler ‘GeForce’ GPU named GTX Titan. GeForce GTX Titan is not only good for gamin...
متن کاملMemory transfer optimization for a lattice Boltzmann solver on Kepler architecture nVidia GPUs
The Lattice Boltzmann method (LBM) for solving fluid flow is naturally well suited to an efficient implementation for massively parallel computing, due to the prevalence of local operations in the algorithm. This paper presents and analyses the performance of a 3D lattice Boltzmann solver, optimized for third generation nVidia GPU hardware, also known as ‘Kepler’. We provide a review of previou...
متن کاملEarly Experiences in Running Many-Task Computing Workloads on GPGPUs
This work aims to enable Swift to efficiently use accelerators (such as NVIDIA GPUs) to further accelerate a wide range of applications. This work presents preliminary results in the costs associated with managing and launching concurrent kernels on NVIDIA Kepler GPUs. We expect our results to be applicable to several XSEDE resources, such as Forge, Keeneland, and Lonestar, where currently Swif...
متن کاملConjugate gradient solvers on Intel Xeon Phi and NVIDIA GPUs
Lattice Quantum Chromodynamics simulations typically spend most of the runtime in inversions of the Fermion Matrix. This part is therefore frequently optimized for various HPC architectures. Here we compare the performance of the Intel R Xeon Phi TM to current Kepler-based NVIDIA R Tesla TM GPUs running a conjugate gradient solver. By exposing more parallelism to the accelerator through inverti...
متن کامل